Using test-time augmentation to investigate explainable AI: inconsistencies between method, model and human intuition

Abstract Stakeholders of machine learning models desire explainable artificial intelligence (XAI) to produce human-understandable and consistent interpretations. In computational toxicity, augmentation of text-based molecular representations has been used successfully for transfer learning on downstream tasks. Augmentations of molecular representations can also be used at inference to compare differences between multiple representations of the same ground-truth. In this study, we investigate the robustness of eight XAI methods using test-time augmentation for a molecular-representation model in the field of computational toxicity prediction. We report significant differences between explanations for different representations of the same ground-truth, and show that randomized models have similar variance. We hypothesize that text-based molecular representations in this and past research reflect tokenization more than learned parameters. Furthermore, we see a greater variance between in-domain predictions than out-of-domain predictions, indicating XAI measures something other than learned parameters. Finally, we investigate the relative importance given to expert-derived structural alerts and find similar importance given irregardless of applicability domain, randomization and varying training procedures. We therefore caution future research to validate their methods using a similar comparison to human intuition without further investigation. Scientific contribution In this research we critically investigate XAI through test-time augmentation, contrasting previous assumptions about using expert validation and showing inconsistencies within models for identical representations. SMILES augmentation has been used to increase model accuracy, but was here adapted from the field of image test-time augmentation to be used as an independent indication of the consistency within SMILES-based molecular representation models. Graphical Abstract


Introduction
Explainable artificial intelligence (XAI) is used to elucidate how predictions from machine learning (ML) models are generated [1].XAI is required to identify security and bias risks, enhance new discoveries by elucidating the reasoning behind predictions and enforce Right of Explanation policies [2].Explanations are reported by assigning relative importance to the input variables.Notably, different XAI methods adopt different approximations to determine this importance in relation to the ML model [3][4][5][6][7].These distinct techniques come with their individual variances, underpinned by a foundation of internal model uncertainty that forms the basis for their explanations.As these XAI methods yield outcomes with varying degrees of uncertainty [8][9][10], it raises concerns about potential distrust among end-users.Thus, in this study, we delve into how the internal model representations affect the stability of these explanations, notably in the field of toxicity prediction.
Historically, computational toxicity prediction relied on statistical methods, structural alerts, and quantitative structure-activity relationship (QSAR) models [11], often based on chemical descriptors or fingerprint representations.However, a recent surge in popularity revolves around natural language processing (NLP) used for textbased representation learning in molecular modeling, exemplified by the high accuracy of the transformer CNN [12] for Ames mutagenicity prediction [13].The Ames test [14] serves as a screen for mutagenic compounds, and had important advantages when it was first introduced, including less test chemical, time, plastic and can be automated [15].In NLP-based molecular representation for prediction of Ames, Simplified Molecular Input Line Entry System (SMILES) [16] are used to represent molecules as a text string.These SMILES are then used within an NLP-driven machine learning (ML) model, such as sequence-to-sequence (seq2seq) [17,18].Later state-of-the-art NLP models used augmentation strategies including masking [19,20], which also spread to NLP models for cheminformatics [12,21,22].Furthermore, the high accuracy of the transformer CNN was partly achieved with data enumeration.Comparable to visionbased ML, where images are rotated to augment the training data, researchers used multiple SMILES strings representing the same molecule [23].This was then used to optimize internal model representations in an autotranslation task to capture intricate structural nuances [24].
When model input can be directly translated to structural parts of a molecule, explanations from XAI methods more naturally correspond to human intuition.These XAI techniques extend their importance assessment capabilities across global and local dimensions.On the global context, the evaluation centers on the role of variables in shaping the model's overall construction [25].In the local context, the focus shifts to the influence of variables on specific predictions, a facet that aligns with our current investigation.Local prediction elucidation strategies have emerged in the literature, including approximating complex models with simpler ML counterparts [5], or leveraging game theory to pinpoint influential input variables by comparing a selection of subsets [4].Alternatively, researchers have harnessed the model's intrinsic architecture to unveil its decision-making processes.In deep learning, integrated gradients [3] use accumulate local gradients to assess variable effects on output.Alternatively, XAI can use inherently interpretive parts of a model.The latter has proven especially influential by using the attention mechanism of transformers, exemplified by recent work from Qiang et al. [7] that utilize the transformer attention mechanism thought to be inherently interpretable [18].
Contrasting multiple input parameters that represent the same underlying ground truth is a method to assess the stability of a model or method.This test-time data augmentation has been used in prior research conducted in domains including medical imaging, where data augmentation techniques, such as image rotation and contrast adjustment, have been used to measure model uncertainty [26,27].These strategies have demonstrated their efficacy in enhancing model performance, gauging model uncertainty, and increasing robustness.However, this methodology remains largely unexplored within the field of XAI concerning molecular representations, presenting an opportunity to assess the robustness of XAI methods.
In this study, we delve into the challenges presented by Langer et al. [28] who investigated that stakeholders of ML models desire XAI methods to produce humanunderstandable and consistent interpretations.We focus on the robustness of local XAI methods for molecular representations, specifically by comparing importance assigned to equivalent SMILES representations.Our study has broad implications for the XAI field, given the potential impact of varying importance assignments in toxicity prediction.Our contributions can be summarized as follows: • We investigate the influence of molecular pre-training settings on the transfer learning on Ames mutagenicity.

Background
Here, we describe the mathematics behind the transformer-based, deep learning and model agnostic methods investigated in this paper.The methods utilize some common parameters and variables including the sequence length S, embedding dimension E, number of attention heads h, and head-specific embedding dimension d = E/h , number of layers L. Furthermore, we define x as the tokenized input sequence, Ames predic- tion model f (x) = y, y ∈ R 2 .XAI attributions are given as φ ∈ R S×E , where E can equal S depending on the XAI method.Because the dimensions of the embedding of φ can be different based on method, we average over the embedding space to make φ i = 1 n n j=0 φ i,j , which means that φ i , i ∈ (1, ..., S) is consistent between methods.

Transformer-based interpretation
Methods that depend on the transformer architecture primarily make use of the attention mechanism.We here redefine the methods proposed by Qiang et al. [7].Vaswani et al. [18] defined the multi-head attention α l h ∈ R h×S×dh of layer l ∈ (0, ..., L) (Eq.1).Later usage of α i,j corresponds to the values averaged over all heads.Furthermore, the output of each layer in the encoder is denoted as o l ∈ R S×E is defined in Eq. 2.
In Eqs. 1, 2 (Q, K, V) ∈ R h×S×d represent the projec- tion matrices of query, key and values respectively, and W O ∈ R E×E represent model weight matrix of the pro- jection in multi-head attention.Bias is left out in the definitions for simplicity.
Firstly, [29] proposed to use the raw attention directly as an explanation of the entire model (Eq.3).Usually the last layer of the encoder is used in this approach.Alternatively, the raw attention by multiplying all raw attentions over all layers of the model can be aggregated to form one explanation, as defined in Eq. 4.
In Eq. 4, I S ∈ R S×S is the identity matrix.
Qiang et al. [7] further expanded research by Selvaraju et al. [30] by using the class-activated gradients.Here, class-activated gradients are denoted using ∇ ω y c with ω being the position with respect to which the gradients are calculated and y c is the index of the output corresponding to the class of interest c (in our case toxic or non-toxic).This was then used in the interpretation method of the attention layers (Eq.5).Firstly, similarly to the attention maps, by using the last layer or a specific layer to explain the entire model.Furthermore, [7] expanded this by summing over all layers and combining the gradients with the raw attention (Eq.6). (1) In Eq. 6, ⊙ represents the Hadamard product.
Finally, [7] have defined rules to use the outputs of the attention layers instead of the attention itself.Here, they combine the outputs o with the class-activated gradients with respect to o, ∇ o,l y c (Eq. 7) and together with the atten- tion (Eq.8).

Deep learning-based interpretation
Methods that are dependent on deep learning, but not specifically transformers-based, usually depend on the calculation of the gradients for their interpretation.The most straightforward way is to use the basic full gradients over the entire model.Integrated gradients (IG) contrasts gradients of a prediction iteratively with the gradients of a background sample by integrating over the differences between input parameters of the sample x and the input parameters of the empty background x (Eq. 9).

Model agnostic interpretation
The model agnostic methods use either inherently interpretable approximations of models [5], or other methods that perturb the input and evaluate model output variation.One perturbation technique is using Shapely additive values (Eq.10) [4] where game theory of parameter subsets are used to identify the relative importance of parameters.
In Eq. 10 F is the set of all subsets, {k} the variable of interest, S the sets of all subsets without {k} , and x S and x S∪{k} are the input with the parameters of subset S and S ∪ {k} respectively. (5)

Data collection and processing
All data was gathered from Therapeutic Data Commons (TDC) (version 0.4.0)[31] which provided standardized output for the ChEMBL database (version 29) [32,33] and the Ames dataset [13] including standardized splitting.Both datasets were cleaned using RDKit (version 22.9.3) [34].Datasets were processed to remove stereochemistry and salts; to correct invalid hybridization, conjugation, chirality, and valency; and to set correct chirality, aromaticity and chemical property flags.Corresponding canonical SMILES were then generated and duplicates were removed.Finally, datapoints that had overlap between Ames and ChEMBL were removed from the ChEMBL dataset.SMILES were tokenized based on the character representations of the string format and given both a beginning-of-sequence (BOS) and end-ofsequence (EOS) token together with optional padding (PAD) tokens to make sequences of equal length, according to the original paper from Karpov et al. [12].Structural alerts were gathered from Kazius et al. [35] and generated using RDKit SMARTS (SMILES arbitrary target specification) representations.Identification of alerts in molecules was performed using RDKit substructure search.

Model architecture and training
Model training was performed using PyTorch (version 2.0.1)[36] and Lightning (version 2.0.5)[37].Early stopping was implemented based on the validation cross entropy loss.Full model parameters for the training are described in Table 1 and a general overview of the methods used are visualized in Fig. 1.

Representation learning
In this study, two transformer-based architectures were explored to learn the molecular representation of SMILES.The first is the encoder-decoder architecture that forms the basis of seq2seq [17] and BART [20] models.The second is the encoder-only architecture that formed the basis for the BERT model [19].The sole difference between these architectures is that the decoder architecture in the former is a transformer block with cross attention and the latter is a simple multilayer perceptron.
The pre-training was based on the ChEMBL database (Table 2), where the task was auto translation.The architecture was trained to produce the canonical SMILES representation from either a canonical SMILES, randomized SMILES or an enumerated SMILES (up to ten randomized SMILES and one canonical SMILES).Additionally, masking was performed as another way to augment the training process, where 15% of the tokens in a SMILES string were replaced by 80% MASK tokens, 10% random character tokens and 10% unchanged tokens.Furthermore, alternative embedding dimensions and pruning options were also explored.

Ames transfer learning
The architectures were used for transfer learning by using the pre-trained encoders from the pre-trained models coupled with a small neural network, the predictor.Both the original implementation of the transformer CNN where the predictor was a TextCNN layer with a highway unit layer, as well as a simple max pooling layer with a multilayer perceptron were used as model variants for the transfer learning stage.Transfer learning was performed using canonical to a binary classification (toxic or non-toxic) (Table 3).During training, the pre-trained encoder were frozen to keep generalization capabilities, and trained on the scaffold split as provided by TDC.In order to give an indication of variance in performance whilst still only using the scaffold split, test-time bootstrapping was performed where the test set was sampled with replacement until full test length, and prediction statistics calculated 1000-fold to indicate variance (Tables 10, 11).
Additionally, we evaluated training variations, including transfer learning using enumerated SMILES to a   binary classification (enumerated), training using a randomly initialized frozen encoder, a completely randomized model without further training and training using the random split instead of the scaffold split (Fig. 7).Details regarding data distribution differences between the data splits are found in Appendix A.

XAI methods
The XAI methods of IG and SHAP were used using captum [38].All other methods were re-implemented.Methods are described in Table 4.For XAI methods that use a specific layer for their interpretation (Attention Maps, Grads), we used the last layer averaged over all attention heads.Furthermore, the original implementations of Grads and AttGrads, and CAT and AttCAT were reimplemented using the PyTorch [36] autograd system instead of hooks as in the original implementation [7].

Statistical analysis
All attributions were obtained based on the SMILES sequences of molecules and represented each token attribution as φ i with i corresponding to each token in the original full string.Analyses of token attributions were analysed in the normalized form were the original φ attri- bution vector was divided by the absolute sum of the total string φi = φ i ||φ|| .We understand φ to correspond the attributions of the full tokenized string, including original SMILES string, BOS, EOS and PAD tokens.Other analyses include analyses of components of SMILES, atom and alerts (Table 5).As mentioned, all analysed components were first normalized with respect to the full tokenized string.
We investigated the normalized attributions by comparing distances between attributions, entropy within the attributions and the relative importance given to relative components of the distributions (Table 6).Cosine similarity was chosen as a distance measure to analyse the variation of importance over attributions using the implementation from SciPy [39].Cosine similarity was chosen to measure the agreement of importance given to each of the tokens.Entropy was calculated as a measure of information, similar to Dabkowski and Gal [40], and relative importance is the fraction of the attribution given to a specific component of the input (Table 5), mostly to investigate the overlap between XAI information and human-derived structural alerts [35].

Computational efficiency
To avoid unnecessary computational overhead, all models were kept small 16.5 M parameters or lower, trained on the smaller ChEMBL dataset rather than the more standard PubChem dataset and using 16-bit precision.All models can be trained using a single GPU.The longest pre-training time in our experiments was 24 h (masked enumerated encoder-decoder) using two GPUs with distributed training for faster and more efficient computing.

ChEMBL representation learning
To generate our internal representations, we first trained attention-based encoder and attention-based decoder (encoder-decoder) and attention-based encoder with an MLP (encoder-only) transformer models on the ChEMBL small molecule dataset (Table 8).Training regimens were to translate canonical to canonical (C2C), randomized to canonical (R2C) and enumerated  to canonical (E2C) as well as the masked versions (e.g., ME2C).Additionally, the encoder-decoder models character and sequence accuracy scores for the canonical to canonical versions were high in all versions except for the R2C models.The encoder-decoder R2C models were more predictive regarding the sequence accuracy without context (greedy-search), but still significantly worse than other training regimens.

Ames transfer learning
To create our final prediction models, we used transfer learning of the pre-trained models to the Ames training set by freezing the encoders and replacing the decoders by either a textCNN or MLP.The MLP results on the scaffold split (Table 7) outperformed the transformerCNN model (Table 9).Because of this, we decided to continue our interpretation analysis with the MLP model.Pre-trained models of C2C and E2C outperformed R2C models, whilst masked models increased model statistics over unmasked models.Finally, encoder-only pre-trained models generally outperformed encoder-decoder models on the Ames transfer learning task.

Interpretation analysis
The robustness of XAI methods was examined from three different perspectives.The first was to analyse XAI methods in the same circumstance, namely the same model and the same input, to see how different the explanation given is over all methods.Figure 2a shows the different attributions from each XAI method for one randomly chosen canonical SMILES of the encoder-decoder ME2C model.Secondly, we want to investigate the variance that internal representation gives.To further illustrate this, the heat map shows the inner variance of the method, and the robustness score is the mean value of this heat map. Figure 2b shows how the different training regimens for an encoder-decoder architecture influence the explanations of IG for on randomly chosen canonical SMILES.Again, the heat map shows the internal consistency for Table 6 Descriptions and equations of metrics used during the analysis of the interpretations Described are three metrics, what they measure and their equations.φ i is the attribution of token i. Entropy and cosine similarity were calculated only using the atom components

Metric Analyses Equation
Cosine distance Distance between attributions cos Information contained in the attribution Relative importance given to component in-between model robustness.Finally, we investigate how input variation all representing the same underlying ground truth, can change the given importance of IG for the encoder-decoder ME2C model.In Fig. 2c, we investigate the canonical SMILES and ten randomized SMILES and compare the internal robustness between the importance given to the atom indices.Importantly, we reshuffle the respective SMILES strings according to RDKit canonical atom order so that each character represents the same atom when compared.

Test-time augmentation as a measure for robustness
Firstly, we analyse the difference between SMILES of the same molecules for each method and model training (Fig. 3).Overall, the cosine distance between these samples is greatest in the IG, SHAP and AttGrads and AttCat.Additionally, the distances of the attention maps, rollout attention are lowest and most variable.The Grads value seem to have the highest variability in most models.All other methods are relatively similar in all models, with the exception of the R2C and MR2C pre-trained models, where all methods have increased cosine distances in the encoder-decoder models, but less pronounced differences in the encoder-only models.Additionally, we investigate whether the distances can be explained by the difference between canonical and random SMILES.No significant differences are found between distances of canonical and a randomly chosen random SMILES in most models.Some exceptions were found, as determined by the Mann Whitney U test (Table 12): CAT and AttCAT for encoder-only ME2C; encoder-decoder C2C AttGrads; encoder-decoder R2C Rollout and AttGrads; encoder-decoder E2C Attention Maps; encoder-decoder MC2C IG and SHAP.In these models the populations do look similar (Fig. 8), but results of these method-model combinations could possibly be explained by randomization attribution differences.
Finally, we utilize entropy as a measure for information to examine if the in-between sample analysis was due to overall small differences between values (Fig. 9).Entropy scores of IG, SHAP, Attention Maps, Rollout attention, AttGrad and CAT were similarly high throughout all models, but the values for Grads and AttCAT were significantly lower, indicating that the test-time augmentation robustness is partially dependent on the entropy the method itself.

Influence of ML model on XAI methods
To investigate the dependence of XAI on models, we investigate the difference of both our test-time augmentation robustness score and the overall entropy of the XAI methods between different models.Firstly, differences between the pre-trained model with AUROC of 0.798, native model with AUROC of 0.763 and untrained model with AUROC of 0.474 in both entropy and in-between sample variation are minimal (Fig. 4).This indicates that the in-between model analysis of these models is dependent on something other than learned parameters.
Secondly, we identify the difference between the indomain training data and the out-of-domain test data, as well as in-domain test data from a model trained using a random split ((Fig.4).The in-domain data shows a larger difference in in-between sample distance, indicating that the XAI methods depend more on learned parameters than the out-of-domain samples.It also indicates that larger cosine distances are not necessarily less consistent with model predictions.Finally, we also analyse if different training techniques (enumerated) or architecture (CNN) affects cosine distances.No obvious distance changes were found (Fig. 4).
Interestingly, investigations into differences in entropy (Fig. 10) based on these variations were minor.We also further investigated if the number of tokens explained the consistency in in-between sample cosine distances (Fig. 11), where it seemed to be carbondependent and explains the variation.

Using test-time augmentation to improve XAI robustness
To further assess the use of test-time augmentation, we analysed the in-between model and in-between method distances (Fig. 5). Figure 5 indicates that both the inbetween model distance and the in-between method distances are reduced when values are averaged using test-time augmentation.This means that values become more consistent when using the average over test-time augmentation between both methods and models.We further investigated whether this effect was because of consistency or overall reduction in information by investigating the entropy of averaged values and canonical values (Fig. 10).There was no difference found between averaged values or canonical values, indicating only consistency between methods change, not specific attributions.

Comparison with expert-based structural alerts
Finally, we analyse the robustness of model explanation by examining the overall attribution of all tokens as opposed to tokens corresponding to expert-derived structural alerts (Fig. 6).Values of IG, SHAP and Att-Grads all consistently had the highest values with mean values around 0.2, whereas a Attention Maps, Rollout and CAT gave consistently similar or slightly lower relative importance.Relative importance of Grads and AttCAT consistently gave the lowest relative attribution to the structural alerts.Relative importance is generally consistent between randomized models, in-domain samples, training variations and test-time augmentation averaging.Similar consistent results were observed in model variations (Fig. 12).
We further analysed the overall attribution of all tokens as opposed to the SMILES tokens, atom tokens and tokens corresponding to expert-derived structural alerts of all methods for each model (Fig. 14) and all methods for each variation (Fig. 13), where similar attributions were found irregardless of model or variation.

Discussion
In this research, we performed an analysis of eight XAI methods and used (1) test-time augmentation, (2) inbetween model, method and sample cosine distances, (3) entropy, and (4) relative importance given to structural alerts to assess the validity of these XAI methods and XAI robustness analyses in the context of NLP-based molecular-representation models for toxicity.

The importance of tokenization in NLP-based research
Notably, we show that overall cosine distances between samples of a trained model and progressively randomly initiated models is minimal.This indicates that model interpretation in these cases is dependent on an aspect  other than the learned model parameters.We analyse the variation with respect to the amount of carbon atoms in Fig. 11, which shows that consistency is dependent on the amount of carbon atoms.This indicates that tokenization is what is measured mostly by these interpretation methods.In fact, we find this aligns with the previous research of Zafar et al. [41] who found that interpretation of untrained models is not random in NLP-based models.
We hypothesise that the questions about the intuitive attributions of random models posed by Zafar et al. [41] can be answered by assessing the influence of the initial tokenization or architecture artifacts.This is further supported by our findings that a native encoder sometimes even outperformed pre-trained encoders during transfer learning, as well as research from Ucak et al. [42] who showed that different tokenizations can have significant effects on final performance in NLP-based molecular representation models.

Usage of human intuition to validate XAI
In this study, we also used expert-derived structural alerts for Ames mutagenicity to investigate if the XAI overlapped with human intuition.A recent study investigated the mean attention weights given to toxicophores in Tox21 prediction and found significantly higher attention weights to toxicophore atoms than non-toxicophore atoms [43].In this study, we analyse the overall importance given to structural alerts and don't divide by the number of atoms.Regardless, in this study we also show that the relative importance given to structural alerts is mostly model-independent and, crucially, no different from completely randomized models.Structural alerts and case studies are often used to investigate the inner workings of ML research, but this finding urges caution using these tactics without further investigation.

Test-time augmentation to improve XAI
Recently, [44] independently analysed the invariance and equivariance of interpretability methods.They used the bag-of-words [45] NLP model in combination with text permutation to assess the robustness of a number of feature importance methods, including IG and gradient SHAP [46].Bag-of-words models differ from our models in that text data in bag-of-words is inherently invariant to text permutation, whereas different text representations in our approach are learned to be invariant given the same underlying molecule.Crabbé and van der Schaar postulated that "...Any interpretability method can be made invariant..., one should increase the number of samples N inv until the desired invariance is achieved.In this way, the method is made robust without increasing the number of calls more than necessary" [44].This is consistent with our findings using test-time augmentation to improve XAI robustness, where we see greater in-between model and in-between method consistency when using values averaged over multiple N inv samples (i.e., test-time augmentations).This was even true in a model that was not inherently invariant.
However, we also identified that neither the amount of information analysed through entropy, nor the relative importance given to expert-derived structural alerts changes when using averaged values.This indicates that test-time augmentation can be used to make XAI more invariant, but not to improve XAI attribution overall.

XAI methods and test-time augmentation for out-of-domain identification
Wang et al. [26] analysed test-time augmentation as a measure of aleatoric (data-based) uncertainty in the task of image segmentation and found it improved over baseline methods.Later, [27] used test-time augmentation to measure epistemic (model-based) uncertainty in the field of image classification where it again found to improve uncertainty classification.Uncertainty can be useful to identify out-of-domain predictions, but, to our knowledge, test-time augmentation has not yet been applied to that area of uncertainty prediction.
In this research, we hypothesize that test-time augmentation has a difference between in-domain and outof-domain predictions.It remains to be seen if these differences can be used to determine out-of-domain predictions, especially in the field of NLP-based molecular representations.This is because we find less variation in out-of-domain in-between sample interpretations than the variation of in-domain interpretations.This indicates that if the epistemic uncertainty is measures the same model effects as XAI methods described here, it will show decreased values in uncertainty for out-of-domain samples.Additionally, this result identifies the need to combine XAI explanations with applicability domain assessments, to verify the explanation.

Test-time augmentation as a measure for robustness
In this study, we analysed test-time augmentation as a measure for XAI robustness.We show substantial disagreement between augmented SMILES, even when canonical SMILES show no more difference than randomized SMILES.However, we also note a number of cautionary findings, including similar disagreements in randomized models, higher disagreement in in-domain distributions and relatively consistent distribution rankings.We did observe higher variation in in-between sample distances of untrained and worse-performing models (R2C and MR2C).This finding was subsequently diminished by the finding that in-domain training set values and random split values had similarly increased variation and higher overall values of in-between sample distances.We therefore conclude that using test-time augmentation as a measure for XAI robustness is inherently valid, but requires a comparison to randomized and applicability domain to draw well-founded conclusions.

XAI implementation differences
Overall, we have investigated eight XAI methods, of which six are transformer-specific, one more general to neural networks and one perturbation XAI method.Firstly, we note that all methods from Qiang et al. [7] were re-implemented, and crucially, the methods of Grads, AttGrads, CAT and AttCAT were re-implemented with changes to the original implementation.Namely, for the gradients of the attention ( ∇α i,j ) and the attention out- put ( ∇h i,j ), we implemented the gradients with respect to the full outcome.We believe this to be in line with the original publication, but have not tested differences due to implementation difficulties.Due to this, investigations into Grads, AttGrads, CAT and AttCAT were can be subject to change based on implementation.We did find the methods of Grads and AttCAT to have significantly lower entropy and lower relative importance to the structural alerts.However, the difference with respect to gradient calculations should not impact our final conclusions, as the methods still use the same underlying principles.
Furthermore, the methods of attention maps and Grads were performed using the last layer and averaged over all attention heads.Some researchers have suggested to use specific layer and head combinations to explain the model based on heuristic approaches, such as Schwaller et al. [47].We leave such analyses to future papers and stay consistent here with the methods from Qiang et al. [7].The methods of IG and SHAP were implemented using the standard Captum [38] library and can therefore serve as proper baseline implementations.

Future investigation of XAI methods for SMILES-based representation models
In general, our findings indicate a greater need to identify what XAI methods measure, and specifically to remove any confounding background information.In our case, all methods in this context seemed to rely not on learned gradients, but likely instead on the tokenization.While this could be the ground truth explanation, randomization experiments indicated that this is not the case.We therefore advocate for approaches that properly take background into account.Notably, local XAI methods are often further refined through comparative analysis, contrasting their findings with those obtained from background or empty samples.Two of our methods included such a comparison to a background sample, namely SHAP and IG.The approach contrasted the input with padding tokens.This did increase their in-between sample distances in randomization studies, but this effect did not translate to the in-domain predictions, which saw similar distances albeit with higher variation.Interestingly, these methods were the only ones affected in the training variations of enumerated training on Ames and the CNN architecture, where other methods stayed consistent.
However, further comparisons, specifically with regards to randomization should help improve robustness in XAI methods.Although, as [8] discussed, randomization can indicate architecture-based priors of XAI, XAI should reflect learned parameters to explain model behaviour.Methods to increase XAI robustness to randomized models can include but are not limited to creating models to investigate ML models, such as [10], contrasting findings to randomized models predictions as background instead of empty samples, and counterfactual explanations, such as the recent study from Fradkin et al. [48].
Finally, we hypothesize that test-time augmentation can improve XAI methods and identify robust XAI methods, but only when XAI methods are both in-domain and measure decision-dependent parameters, not confounding information.

Conclusion
In this research, we performed an analysis of eight XAI methods and used several analyses to assess the validity of these XAI methods and XAI robustness in the context of NLP-based molecular-representation models for toxicity.We report significant differences between explanations for different representations of the same ground truth.Additionally, we show that randomized models are similarly different, indicating that the XAI methods applied to NLP-based molecular representations in this and past research reflect tokenization more than learned parameters.Interestingly, we see a greater variance between in-domain predictions than out-ofdomain predictions, further supporting this hypothesis.Furthermore, we investigated the relative importance given to expert-derived structural alerts and find similar importance given irregardless of applicability domain, randomization and training variation.We therefore caution future research to validate their methods using a similar comparison to human intuition without further investigation into the validity and robustness of the XAI method used.Finally, we note that test-time augmentation can be used as a measure of robustness, only if used in conjunction with other XAI method analyses, and note a greater need to identify what XAI precisely measure before drawing conclusions.

Fig. 1
Fig. 1 Overview of the methods used throughout the research.Data augmentation is used during pre-training.Transfer learning uses the pre-trained transformer encoder together with a small neural network or CNN.Thereafter, eight XAI methods subdivided into three groups were used for interpretation

Fig. 2 Fig. 3
Fig. 2 Single instance robustness analysis of XAI methods from three different angles: in-between method, in-between representation and in-between input.a Importance given to each input parameter varied over XAI methods, given a constant canonical SMILES input and encoder-decoder masked enumerated to canonical (ME2C) model.b Importance given to each input parameter varied over representations of the encoder-decoder model, given a constant canonical SMILES input and IG XAI method.c Importance given to each input parameter varied over different SMILES representations, given a constant canonical SMILES input, IG XAI method and encoder-decoder masked enumerated to canonical (ME2C) model

Fig. 4
Fig. 4 In-between sample cosine distance of different training settings.Cosine distances of different attributions given to different SMILES representations per molecule of different XAI methods of different training settings.Training settings include baseline ME2C of the encoder-decoder pre-training architecture, untrained, frozen encoder (native), completely random model (untrained), distances of the training data (train) of baseline, distances of the test set on a random split model (random), enumerated training (enumerated) and the statistics of the CNN model (CNN).All variations had the ME2C encoder-decoder model as the initial encoder with the exception of native and untrained

Fig. 5
Fig.5 Comparison of the canonical and averaged values of in-between model and in-between method cosine distances.Cosine distances between different encoder-decoder models for the same method (in-between model) and different methods for the same encoder-decoder model (in-between method) of canonical and averaged atom attributions

Fig. 6
Fig. 6 Relative importance given to expert-derived structural alerts.Relative token importance given to atoms corresponding to expert-derived structural alerts.Relative importance is given for canonical representations of different XAI methods of different training settings.Training settings include untrained, frozen encoder (native), completely random model (untrained), distances of the training data (train), distances of the test set on a random split model (random), enumerated training (enumerated), the statistics of the CNN model (CNN) and the averaged values of all SMILES enumerations of the ME2C model

Fig. 9 Fig. 10
Fig. 9 Entropy values of each XAI method per model.Entropy values of XAI methods of canonical representations indicating relative information in the attributions

Fig. 11 Fig. 12 Fig. 13
Fig. 11 Cosine distance as a measure of number of carbon atoms.Variations of in-between sample cosine distances and carbon atoms, broken down to show training variations and XAI methods

Fig. 14
Fig. 14 Relative importance given to different components for experiments.Relative token importance given to smiles, atoms and atoms corresponding to expert-derived structural alerts.Relative importance is given for canonical representations and aggregated over all XAI methods of different pre-trained representation models

Table 1
Model training details

Table 3
Ames data breakdown

Table 4
Descriptions and equations of metrics used during the analysis of the interpretationsDescribed are three metrics, what they measure and their equations.φ i is the attribution of token i. Entropy and cosine similarity were calculated only using the atom components

Table 5
Names of components, description of components and example of token stringsDescribed are four names with corresponding sections of the tokens analysed in statistical analyses.Most analyses use atom tokens or alerts but are relative to the full tokenized string

Table 7
AUROC, accuracy, F1, MCC precision and recall scores of MLP models transfer learned on Ames data

Table 8
Model Statistics of pretrained transformer modelsBold values are accuracies calculated without previous token information.Models were tested on canonical SMILES to canonical SMILES (canonical) and enumerated SMILES to canonical SMILES (enumerated)

Table 9
AUROC, accuracy, F1, MCC precision and recall scores of the TransformerCNN models transfer learned on Ames data

Table 10
AUROC, accuracy, F1, MCC precision and recall scores with bootstrap variability of MLP models transfer learned on Ames dataValues are based on the scaffold split.± values have been determined using 1000 fold test-time bootstrapping

Table 11
AUROC, accuracy, F1, MCC precision and recall scores with bootstrap variability of the TransformerCNN models transfer learned on Ames dataValues are based on the scaffold split.± values have been determined using 1000 fold test-time bootstrapping